Description of dataset

The dataset describes the different parameters of taking a loan, specifically related to the applicant's features. Fetching a loan depends on the parameters of the applicants, whethere he has home_ownership, the annual income, the state he is living in, delinquencies in his previous accounts, etc. We have done different visualizations to understand the relationships and have also applied machine learning models to predict the interest rate.

Issues with the data set

  1. We have 833 records with empty values of ‘emp_title’ in the table. But there are loan records for such rows. For such rows, the emp_length is N/A
  2. The emp_title for ‘nail_tech’ has salary ‘0’ which does not make sense. Also, there are 23 such records that have annual income ‘0’ and 1 record having annual income ‘1’.
  3. In the verified_income and ‘verification_income_joint’, it contains ‘verified’ and ‘non-verified’ parameters. There is also a ‘source verified’ parameter which does not say anything specific and no difference between the ‘verified’ and ‘source verified’.
  4. Not able to understand the parameter ‘num_accounts_120d_past_due’ as there are only two values ‘0’ and ‘N/A’.
  5. The column name ‘account_never_delinq_percent’ has a value of 100% which means that the applicant never had any delinquency. But there are 38 records in delinq_2y which here the column means that the applicant was delinquent in the last 2 years and thus needs further clarification.
  6. The ‘grade’ and ‘sub_grade’ columns should have a description to do further data analysis.
  7. The ‘fees’ column is left blank and there is no data or description for the same.

3rd point: Creating a feature set and create a model which predicts interest_rate using at least 2 algorithms.

Here we can conclude that the Linear regression would work better than the Random forest classifier as the mean_squared error of linear regression is lower than the random forest algorithm.

Propose enhancements to the model if given more time

I could have used hyperparameter tuning which allows us to make use of the optimal parameters to the machine learning models which increases the performance and computaional time of the models.

Assumptions for this model